This is the final assignment for open data course, where data from human development programme will be used. Focus of this assignment will be on expected years of education and I will perform LDA analysis to find out what are the variables that determine the level of education.In addition, LDA results with target variable expected years of education will be compared to LDA with clusters as target variables. Both find mother’s mortality rate being in significant role and life expectancy and mean yeras of education in additional place to determine the expected years of education.
This is a sample of a progress of a Bachelor level historian developing her skills with statistics with R-software. I have statistics as a minor and just started with R programming language. My message to all of you is: If I can do this, so can you. I’m not saying it’s easy, but I’m saying it’s possible.
This course has been an introductory course to R-software using R-studio, Github and DataCamp. Through these weeks we have learned how to use R.markdown files and how to push changes from R.mrakdown files to an html web page with Github. We have also learned different data wrangling and analysing techniques briefly with data camp: link
What I’ve done so far in this course can be seen at my course diary: link. We have had weekly assignments regarding a different data analysing method each week.
This is the final assignment for the course introduction to open data science. In this assignment we create our own web pages from the scratch (Oh, yes, you’re reading it) and perform statistical analysis with data chosen.
At my Github repository anyone can see the R.markdown codes for this web page. Basically, it means that if you have Rstudio and Github username, YOU can make your own web page, too. How cool is that! Link to my Github repository is as promised: link
If you keep up with me, you will see how real life data is wrangled so that classes and groups from the data can be predicted and visualized. I’ll do my best to explain every step adequately.
I’m going to use my human data from previous exercises. The dataset originates from United Nations Development Programme, and you can read more about it from here: link.
I find this data the most interesting one we’ve used, since it’s actual data that describes the wellbeing of nations in a very multilinear way. Instead of just using Gross National Product or Index as an explaining variable, we can take inequality or education levels as explanatory variables when explaining the wellbeing of a state. This way we do not compare nations simply based on their wealth but based on their human capacity.
I have created a subset called “human2”, since we have already used human data previously. The data consists of 10 variables and 155 observations. Variables modified from original dataset are Country, Edu2FM, LabFM and GNI. All variables are numeric and continuous. For further information about modifications made, you can go and check my Github repository: link
#Bringing data to R.markdown + summaries
getwd()## [1] "C:/Users/murmeli/Documents/murmeli/2017/GitHub/IODS-final"
human2 <- read.csv("data/human2.csv", sep=",", header = TRUE, row.names = 1)
str(human2)## 'data.frame': 155 obs. of 10 variables:
## $ HDI : num 0.944 0.935 0.93 0.923 0.922 0.916 0.916 0.915 0.913 0.913 ...
## $ Edu2FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ LabFM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 166 135 156 139 140 137 127 154 134 117 ...
## $ Mot.mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Adol.birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parl.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## $ Edu.y.mean: num 12.6 13 12.8 12.7 11.9 13.1 12.2 12.9 13 12.5 ...
dim(human2)## [1] 155 10
head(human2)## HDI Edu2FM LabFM Edu.exp Life.exp GNI Mot.mor
## Norway 0.944 1.0072389 0.8908297 17.5 81.6 166 4
## Australia 0.935 0.9968288 0.8189415 20.2 82.4 135 6
## Switzerland 0.930 0.9834369 0.8251001 15.8 83.0 156 6
## Denmark 0.923 0.9886128 0.8840361 18.7 80.2 139 5
## Netherlands 0.922 0.9690608 0.8286119 17.9 81.6 140 6
## Germany 0.916 0.9927835 0.8072289 16.5 80.9 137 7
## Adol.birth Parl.F Edu.y.mean
## Norway 7.8 39.6 12.6
## Australia 12.1 30.5 13.0
## Switzerland 1.9 28.5 12.8
## Denmark 5.1 38.0 12.7
## Netherlands 6.2 36.9 11.9
## Germany 3.8 36.9 13.1
The data wrangling done here is strongly based on the original ch4 workout. In addition I have added variable HDI and mean years of education. I hope these additions will provide more of background knowledge to the evaluations of the connections. Later on I will continue data wrangling for the analysing part.
Underneath are the names of the 10 variables and their description.
Country: used as row names, regions not included
Edu.y.mean: mean years of education/ Country
In addition I have excluded missing values from our data set by removing them.
#Complete cases, whithout missing values
complete.cases(human2)## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [71] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [99] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [113] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [127] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [141] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [155] TRUE
#Packages we may need in this workout
library(GGally)
library(dplyr)##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
library(FactoMineR)
library(ggplot2)
library(tidyr)
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
Let’s do some correlation matrices and plots to visualize them in order to notice the connection between variables. After that we can do some prelimenary assumptions to our results.
cor_human<-cor(human2) %>% round(digits=2)
cor_human ## HDI Edu2FM LabFM Edu.exp Life.exp GNI Mot.mor Adol.birth
## HDI 1.00 0.69 -0.06 0.92 0.90 0.18 -0.85 -0.78
## Edu2FM 0.69 1.00 0.01 0.59 0.58 0.18 -0.66 -0.53
## LabFM -0.06 0.01 1.00 0.05 -0.14 -0.01 0.24 0.12
## Edu.exp 0.92 0.59 0.05 1.00 0.79 0.12 -0.74 -0.70
## Life.exp 0.90 0.58 -0.14 0.79 1.00 0.19 -0.86 -0.73
## GNI 0.18 0.18 -0.01 0.12 0.19 1.00 -0.10 -0.13
## Mot.mor -0.85 -0.66 0.24 -0.74 -0.86 -0.10 1.00 0.76
## Adol.birth -0.78 -0.53 0.12 -0.70 -0.73 -0.13 0.76 1.00
## Parl.F 0.17 0.08 0.25 0.21 0.17 0.00 -0.09 -0.07
## Edu.y.mean 0.90 0.67 0.08 0.83 0.73 0.23 -0.73 -0.69
## Parl.F Edu.y.mean
## HDI 0.17 0.90
## Edu2FM 0.08 0.67
## LabFM 0.25 0.08
## Edu.exp 0.21 0.83
## Life.exp 0.17 0.73
## GNI 0.00 0.23
## Mot.mor -0.09 -0.73
## Adol.birth -0.07 -0.69
## Parl.F 1.00 0.17
## Edu.y.mean 0.17 1.00
#Visualize correlation matrix with a plot
title <- "Correlations between variables in human data"
corrplot(cor_human, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6, title = title, mar=c(0,0,1,0))Figure 1. Correlations between variables in human data
In figure 1we see that HDI has the highest correlations of all, but that is expected since it is a combination variable of many others. This shows, that HDI works as it’s is supposed to; it is able to combine variables concerning a long and healthy life and a long education to it. In addition it also reveals that not taking care of mother’s influences to general human development negatively.
More precisely, mothers’ high mortality rate influences negatively to life expectancy in general. It is not as strong as HDI, but quite strong nevertheless (-0.74). When more mothers die, the expected years of education lower, too. When taking a sub group of adolescent mothers, the interpretation is similar.
With these correlations I can assume that HDI, mothers’ mortality rate and adolescent birth rate determine tge expected years of education. In other words: if a state can make its citizens have a long and healthy life, it increases level’s of education, too.
Our data concerns broad view on human capacity and in my opinion, education is one of the best ways to increase human capacity. If we want to use fully our potential human capacity, we need to find out what are the structures that determine the level of education. It is in the hands of politicians to determine what the best level of education is but it is the work of science to find out what are the reasons why some people have a certain level of education.
Because I’m interested in expected levels of education, I believe classification method is the best analysing tool for that. I can divide the levels of education to groups and find out what variables determine which group. I don’t aim to solve fully what are the reasons for such and such education level, but with our data we can use Linear Discriminant Analysis (LDA) to find out what are the variables that determine the groups of education in our data.
But first we need to wrangle the data in order to make analysing with LDA possible. Let’s find out what we need to do by looking the summary of the data.
Summary of the data
summary(human2)## HDI Edu2FM LabFM Edu.exp
## Min. :0.3480 Min. :0.1717 Min. :0.1857 Min. : 5.40
## 1st Qu.:0.5925 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25
## Median :0.7330 Median :0.9375 Median :0.7535 Median :13.50
## Mean :0.7039 Mean :0.8529 Mean :0.7074 Mean :13.18
## 3rd Qu.:0.8290 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20
## Max. :0.9440 Max. :1.4967 Max. :1.0380 Max. :20.20
## Life.exp GNI Mot.mor Adol.birth
## Min. :49.00 Min. : 2.00 Min. : 1.0 Min. : 0.60
## 1st Qu.:66.30 1st Qu.: 53.50 1st Qu.: 11.5 1st Qu.: 12.65
## Median :74.20 Median : 99.00 Median : 49.0 Median : 33.60
## Mean :71.65 Mean : 98.73 Mean : 149.1 Mean : 47.16
## 3rd Qu.:77.25 3rd Qu.:143.50 3rd Qu.: 190.0 3rd Qu.: 71.95
## Max. :83.50 Max. :194.00 Max. :1100.0 Max. :204.80
## Parl.F Edu.y.mean
## Min. : 0.00 Min. : 1.400
## 1st Qu.:12.40 1st Qu.: 6.000
## Median :19.30 Median : 8.500
## Mean :20.91 Mean : 8.302
## 3rd Qu.:27.95 3rd Qu.:10.900
## Max. :57.50 Max. :13.100
We have all sorts of values in our variables from under a 1 or from 1 to 1100. In classification method, we’re interested in the distances of the points in the data. In order to make these distances comparable, we need to scale the data.
Distribution
#Draw a density plot of each variable
gather(human2) %>% ggplot(aes(value,fill="#99999",alpha=0.3)) + facet_wrap("key", scales = "free") +
geom_density() +
ggtitle("Density plots of the variables in the human data") +
theme(plot.title = element_text(hjust = 0.5,size=16,face='bold'),legend.position="none")Figure 2. Density plots of the variables in the human data
LDA assumes that variables are normally distributed. In figure 2 we see that mothers’ mortality rate and adolescent birth rates are biased both to the left. All other are reasonably normally distributed.
scaled human
# center and standardize variables
human_scaled <- scale(human2)
# summaries of the scaled variables
summary(human_scaled)## HDI Edu2FM LabFM Edu.exp
## Min. :-2.2981 Min. :-2.8189 Min. :-2.6247 Min. :-2.7378
## 1st Qu.:-0.7194 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782
## Median : 0.1878 Median : 0.3503 Median : 0.2316 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8076 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126
## Max. : 1.5502 Max. : 2.6646 Max. : 1.6632 Max. : 2.4730
## Life.exp GNI Mot.mor Adol.birth
## Min. :-2.7188 Min. :-1.767729 Min. :-0.6992 Min. :-1.1325
## 1st Qu.:-0.6425 1st Qu.:-0.826563 1st Qu.:-0.6496 1st Qu.:-0.8394
## Median : 0.3056 Median : 0.004952 Median :-0.4726 Median :-0.3298
## Mean : 0.0000 Mean : 0.000000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6717 3rd Qu.: 0.818192 3rd Qu.: 0.1932 3rd Qu.: 0.6030
## Max. : 1.4218 Max. : 1.741082 Max. : 4.4899 Max. : 3.8344
## Parl.F Edu.y.mean
## Min. :-1.8203 Min. :-2.19845
## 1st Qu.:-0.7409 1st Qu.:-0.73323
## Median :-0.1403 Median : 0.06309
## Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6127 3rd Qu.: 0.82755
## Max. : 3.1850 Max. : 1.52831
# class of the human_scaled object
class(human_scaled)## [1] "matrix"
#change the object to data frame
human_scaled <- as.data.frame(human_scaled)
class(human_scaled)## [1] "data.frame"
Now we have values only between -5 to 5. It is also transformed back to data frame, so further analysis can be made.
Now all we need is the target variable from the expected level of education. What we need is to divide our expected level of education to four classes based on expected years of education.
Factor variable
# save the Edu.exp as scaled_eduexp
scaled_eduexp <- human_scaled$Edu.exp
# summary of the scaled_eduexp
summary(scaled_eduexp)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.7378 -0.6782 0.1140 0.0000 0.7126 2.4730
# create a quantile vector of eduexp and print it
bins <- quantile(scaled_eduexp)
bins## 0% 25% 50% 75% 100%
## -2.7378319 -0.6781546 0.1140290 0.7125677 2.4729757
# create a new categorical variable 'eduexp'
eduexp <- cut(scaled_eduexp, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor
table(eduexp)## eduexp
## low med_low med_high high
## 39 42 36 38
# remove original Edu.exp from the dataset
human_scaled <- dplyr::select(human_scaled, -Edu.exp)
# add the new categorical value to scaled data
human_scaled <- data.frame(human_scaled, eduexp)
str(human_scaled)## 'data.frame': 155 obs. of 10 variables:
## $ HDI : num 1.55 1.49 1.46 1.41 1.41 ...
## $ Edu2FM : num 0.639 0.596 0.54 0.562 0.481 ...
## $ LabFM : num 0.923 0.561 0.592 0.888 0.61 ...
## $ Life.exp : num 1.19 1.29 1.36 1.03 1.19 ...
## $ GNI : num 1.229 0.663 1.047 0.736 0.754 ...
## $ Mot.mor : num -0.685 -0.676 -0.676 -0.68 -0.676 ...
## $ Adol.birth: num -0.957 -0.853 -1.101 -1.023 -0.996 ...
## $ Parl.F : num 1.627 0.835 0.661 1.488 1.392 ...
## $ Edu.y.mean: num 1.37 1.5 1.43 1.4 1.15 ...
## $ eduexp : Factor w/ 4 levels "low","med_low",..: 4 4 4 4 4 4 4 4 4 4 ...
Now we have a new categorical variable “eduexp” instead of continuous “Edu.exp” in our scaled data set. It has levels low, med_low, med_high and high based on how the years are divided quantiles in our real data. The top 25% belongs to high education and bottom 25% to low class.
In order to divide our education level to groups in the original data we need a training test to create the 4 groups. We also need a test set to find out how well our model actually predicts the wanted groups in the data. Our model is created by train set and tested with test set.
We’ll divide the whole data so, that randomly selected 80% of the data is used for the training and rest 20% for testing. We also need to exclude and save the original observations from expected level of education. Since that is what we want to compare to, it cannot be in the testing set.
Test and train
# number of rows in the human dataset
n <- nrow(human_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- human_scaled[ind,]
# create test set
test <- human_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$eduexp
summary(correct_classes)## low med_low med_high high
## 8 8 9 6
# remove the eduexp variable from test data
test <- dplyr::select(test, -eduexp)Now we’re ready to use the Linear Discriminant Analysis (LDA) in order to separate the data to groups. We will use train set to find those variables that best separate our target variable and test set to check how well our model predicts the classes.
In our LDA we have categorical expected years of education as a target variable against all other variables. Here we use our train set to create the model. Now we’re ready to run the LDA, and it will create a model to predict expected years of education using all other variables. (The function for arrows in the plot was provided in our data camp exercises.)
Model 1
lda.fit <- lda(formula= eduexp ~ ., data = train)
# print the lda.fit object
lda.fit## Call:
## lda(eduexp ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2741935 0.2177419 0.2580645
##
## Group means:
## HDI Edu2FM LabFM Life.exp GNI
## low -1.3326860 -0.96297729 0.3005164 -1.1400189 -0.3304838
## med_low -0.1622081 -0.01938138 -0.1822659 -0.1116193 0.0834272
## med_high 0.4384091 0.44880462 -0.4161161 0.3434043 -0.2089339
## high 1.0981996 0.48232138 0.2948259 0.9567200 0.3538913
## Mot.mor Adol.birth Parl.F Edu.y.mean
## low 1.1970798 1.14806616 -0.01308547 -1.2151285
## med_low -0.1623214 0.04450096 -0.15590741 -0.1402062
## med_high -0.5272991 -0.44862502 -0.36565065 0.4311643
## high -0.6336896 -0.78047273 0.45899014 0.9639225
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## HDI 3.39414899 1.01672037 -0.1520833
## Edu2FM -0.35151307 -0.22608818 -0.6607163
## LabFM 0.03276751 0.33024339 0.4676654
## Life.exp -0.40482640 0.91857344 -0.5532308
## GNI 0.04573263 0.08209861 0.8414125
## Mot.mor 0.21302772 2.00452609 -1.9467411
## Adol.birth -0.10774944 -0.11344636 0.2262729
## Parl.F -0.04027252 0.28160124 0.3979489
## Edu.y.mean -0.19798605 -0.29957565 -0.5590494
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9146 0.0709 0.0145
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$eduexp)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes, main = "Figure 3. LDA model 1, classes")
lda.arrows(lda.fit, myscale = 1)#another plot with arrows showing clearly
plot(lda.fit, dimen=2, cex = c(0.1), main = "Figure 4. LDA model 1, arrows")
lda.arrows(lda.fit, myscale = 2)Our plot for the LDA is actually also a dimensionally reduced way of picturing the results. It reduces our 10 variable plots to 2 components: LD1 and LD2. LD1 captures 92% of between group variance and LD2 is left with 6%.
I have created two plots to picture the model: figure 3 emphasizes how well the groups are separated by giving each group a different colour and figure 4 emphasizes the variables as arrows. The longer the arrow and narrower the angle between axis and an arrow, the stronger the connection between LD components and variables. The more separated groups the better the model.
In figure 3 we can see that the high end and the low end are best separated. In figure 4 we notice that HDI has the biggest connection to LD1 and so captures the largest variance between our groups of expected years of education. In other words, HDI is our strongest variable that separates the groups. Mother’s mortality rate is the second strongest explanatory variable and strongly connected to LD2.
But since HDI is a combination variable based on already existing variables, I want to test what can be seen without it. That is why I’ll exclude it in the second model, so it won’t cloud the interpretation of the results. So here we will do just as before but just without the HDI.
Data without HDI
human_scaled2 <-human_scaled
human_scaled2 <- dplyr::select(human_scaled2, -HDI)Test and training set without HDI
# number of rows in the human dataset
n2 <- nrow(human_scaled2)
# choose randomly 80% of the rows
ind2 <- sample(n2, size = n * 0.8)
# create train set
train2 <- human_scaled2[ind2,]
# create test set
test2 <- human_scaled2[-ind2,]
# save the correct classes from test data
correct_classes2 <- test2$eduexp
summary(correct_classes2)## low med_low med_high high
## 7 6 5 13
# remove the eduexp variable from test data
test2 <- dplyr::select(test2, -eduexp)Model 2, whithout HDI
lda.fit2 <- lda(formula= eduexp ~ ., data = train2)
# print the lda.fit object
lda.fit2## Call:
## lda(eduexp ~ ., data = train2)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2580645 0.2903226 0.2500000 0.2016129
##
## Group means:
## Edu2FM LabFM Life.exp GNI Mot.mor Adol.birth
## low -1.1190756 0.1733151 -1.2043593 -0.1909338 1.3265401 1.1336558
## med_low 0.1595313 -0.2034896 -0.1217777 0.1699351 -0.1375867 0.1475550
## med_high 0.4473930 -0.4340422 0.4298978 -0.1271001 -0.5545065 -0.4439637
## high 0.4829218 0.3434156 1.0238102 0.3996932 -0.6527416 -0.8212363
## Parl.F Edu.y.mean
## low -0.1868067 -1.2508343
## med_low -0.1777693 -0.1386448
## med_high -0.4842748 0.4340185
## high 0.6400200 0.9524158
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## Edu2FM -0.033700937 -0.5308421 0.1910568
## LabFM 0.068409444 0.2333191 0.2622300
## Life.exp 0.983897047 1.0291586 -0.7800744
## GNI -0.078037722 0.1271690 0.6428329
## Mot.mor 0.103461601 1.2526326 -1.3857659
## Adol.birth -0.378301886 -0.1746635 0.4115476
## Parl.F -0.007098948 0.6585490 0.5981650
## Edu.y.mean 0.967372704 0.2448088 -0.5474758
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.8464 0.1273 0.0263
# the function for lda biplot arrows
lda.arrows2 <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes2 <- as.numeric(train2$eduexp)
# plot the lda results
plot(lda.fit2, dimen = 2, col=classes2, pch=classes2, main = "Figure 5. LDA model 2, classes")
lda.arrows2(lda.fit2, myscale = 1)#another plot with arrows showing
plot(lda.fit2, dimen=2, cex = c(0.1), main = "Figure 6. LDA model 2, arrows")
lda.arrows2(lda.fit2, myscale = 2)As we can see in figure 5, without HDI our groups are a bit more mixed up, especially in med low and med high levels of education. In figure 6, we notice that mothers’ mortality seems to be in great value, capturing second most variance from the data and determining how to group the education levels. When comparing to model one, we see that mean years of education is only one even closely related to LD1. That means that HDI in the first model was mainly based on years of education and life expectancy at birth.
So if the state already has high level of education, it is expected to continue. There is no big surprise in that, so in the sense of our interpretation, mothers’ wellbeing and general life expectansies are more interesting results, although statistically less valuable.
What is more visible in the second model is the negative correlation between mother’s mortality and education ratio from women to men. The higher the mother’s mortality rate, the less women get to over second level education.
Since the mothers’ mortality has a lot stronger connection to LD2 than edu2FM, I’d say that it is the wellbeing of mothers’ that count in estimating both expected years of education and the ratio from women to men in education level instead the other way around. The gender inequality in education levels starts already when mothers’ mortality rate is high and effects to education levels all together. Unfortunately the mother’s mortality rate wasn’t distributed perfectly normally. This might change the interpretation of the results a bit and we need to consider its value in caution.
Model 2 in 3D
model_predictors <- dplyr::select(train2, -eduexp)
# check the dimensions
dim(model_predictors)## [1] 124 8
dim(lda.fit2$scaling)## [1] 8 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit2$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train2$eduexp)In figure 7 we can see the groups in 3D in our LDA model. So instead of first 2 components it catches first 3 components from model 2. The 3 components catch 99% of the between group variance. If you turn around the model you can see clear groups with slight mix in the edges of the group centres.
We have interpreted the results but are the results valid? This is the point where we need our test set to test how well our models predict the data not used for training.
Model 1
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)## predicted
## correct low med_low med_high high
## low 7 1 0 0
## med_low 0 4 4 0
## med_high 0 3 3 3
## high 0 0 1 5
table(correct = correct_classes, predicted = lda.pred$class) %>% prop.table()%>%addmargins()## predicted
## correct low med_low med_high high Sum
## low 0.22580645 0.03225806 0.00000000 0.00000000 0.25806452
## med_low 0.00000000 0.12903226 0.12903226 0.00000000 0.25806452
## med_high 0.00000000 0.09677419 0.09677419 0.09677419 0.29032258
## high 0.00000000 0.00000000 0.03225806 0.16129032 0.19354839
## Sum 0.22580645 0.25806452 0.25806452 0.25806452 1.00000000
Model 2
# predict classes with test data
lda.pred2 <- predict(lda.fit2, newdata = test2)
# cross tabulate the results
table(correct = correct_classes2, predicted = lda.pred2$class)## predicted
## correct low med_low med_high high
## low 6 1 0 0
## med_low 0 1 5 0
## med_high 0 1 1 3
## high 0 1 3 9
table(correct = correct_classes2, predicted = lda.pred2$class) %>% prop.table()%>%addmargins()## predicted
## correct low med_low med_high high Sum
## low 0.19354839 0.03225806 0.00000000 0.00000000 0.22580645
## med_low 0.00000000 0.03225806 0.16129032 0.00000000 0.19354839
## med_high 0.00000000 0.03225806 0.03225806 0.09677419 0.16129032
## high 0.00000000 0.03225806 0.09677419 0.29032258 0.41935484
## Sum 0.19354839 0.12903226 0.29032258 0.38709677 1.00000000
The model1 with HDI predicts better, but in this version we have some variables in a way twice. That’s why I believe that the second model is more interesting, as we can see more detailed results.
Also, even in the second case, our results are quite good in the low and high end. All classes have 19-25%, when the original split was even 25% (the exact numbers vary since it calculates it every time randomly all over again). It seems that same variables explain both med low and med high education levels and shows that our model isn’t that good in predicting the middle years.
I was also interested if our classification with expected years of education was reasonable. That’s why I decided to test clustering method as addition to our interpretation.
Clustering is an unsupervised method which divides our data to groups without knowing the classes beforehand. In this exercise we will use k-means, which is one of the most used methods to perform clustering.
But first, we need to scale our data again in order to calculate distances between observations properly. We will do that without HDI, since we discovered that model 2 is more detailed in describing data.
Summary of the scaled human3
human_scaled3 <- scale(human2)
human_scaled3 <- as.data.frame(human_scaled3)
human_scaled3 <- dplyr::select(human_scaled3, -HDI)
summary(human_scaled3)## Edu2FM LabFM Edu.exp Life.exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mot.mor Adol.birth Parl.F
## Min. :-1.767729 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.826563 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median : 0.004952 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.000000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.818192 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 1.741082 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
## Edu.y.mean
## Min. :-2.19845
## 1st Qu.:-0.73323
## Median : 0.06309
## Mean : 0.00000
## 3rd Qu.: 0.82755
## Max. : 1.52831
class(human_scaled3)## [1] "data.frame"
human_scaled3<- as.data.frame(human_scaled3)Because clustering (with k-means) is an unsupervised method, we do not know the optimal number of clusters beforehand. That’s why we use WCSS (total within sum of squares) to find out the optimal number of clusters. That’s when in figure 8 the line drops rapidly. In this case, it’s 2 clusters. When two clusters, the observations are the closest to cluster centre.
dist_eu <- dist(human_scaled3)
# look at the summary of the distances
summary(dist_eu)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2408 2.7319 3.7108 3.9185 4.9510 9.7109
library(ggplot2); library(GGally)
set.seed(123)
# euclidean distance matrix
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b', main="Results of the total whithin sum of squares") Figure 8. Results of the total whithin sum of squares
Now let’s do kmeans clustering with 2 centers.
# k-means clustering
km <-kmeans(dist_eu, centers = 2)
human_scaled3$km <- as.factor(km$cluster)
# plot the human dataset with clusters
ggpairs(human_scaled3, ggplot2::aes(colour=km), title="Paired variables with clusters")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 9. Paired variables with clusters
If we compare these clusters from figure 9 to our previous classification results, we find out that expected years of education is strongly correlated to life expectancy (0.78) as in classification, but highest to mean years of education (0.83). It is also correlated to both mother’s mortality (0.73) and adolescent births (0.70) as in classification method. Although, adolescent birth rate has a stronger influence than it has in classification.
Although our optimal amount of clusters was 2, I will test it with the same amount of clusters as was in our classification. Then I will perform LDA with 4 clusters and clusters as target variable. Then I will compare the results of the LDA with classification method.
# k-means again
set.seed(123)
km2 <- kmeans(dist_eu, centers = 4)
# LDA with using the k-means clusters as target classes
human_scaled3$cl <- km2$cluster
lda.fit3 <- lda(cl ~ ., data = human_scaled3)
lda.fit3## Call:
## lda(cl ~ ., data = human_scaled3)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.2451613 0.1870968 0.4193548 0.1483871
##
## Group means:
## Edu2FM LabFM Edu.exp Life.exp GNI Mot.mor
## 1 0.4786420 0.4357162 1.1582289 1.0197296 0.3887281 -0.6498738
## 2 -0.2204688 -0.1871010 -0.7649609 -0.7985475 0.1139721 0.3887345
## 3 0.4104487 -0.3680427 0.1687371 0.3021125 -0.1108838 -0.4647902
## 4 -1.6727810 0.5561516 -1.4259453 -1.5317027 -0.4725832 1.8970984
## Adol.birth Parl.F Edu.y.mean km2
## 1 -0.8494314 0.666748707 1.0681248 1.0000000
## 2 0.4928477 -0.078454736 -0.7991303 0.4137931
## 3 -0.2699624 -0.357109408 0.2747867 1.0000000
## 4 1.5449291 0.006558608 -1.5337000 0.0000000
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## Edu2FM -0.200490376 -0.52269717 0.7120167
## LabFM 0.046086172 0.26933798 -0.1196263
## Edu.exp -0.449079030 0.71915089 0.5508557
## Life.exp -1.103612537 1.37128347 -0.6872693
## GNI 0.003798198 0.03719671 0.5919234
## Mot.mor -0.005812516 1.59910186 -1.2922396
## Adol.birth 0.393923388 -0.06607181 -0.1831562
## Parl.F 0.060019307 0.39975421 0.4098182
## Edu.y.mean -1.225084050 0.51799883 -0.4690829
## km2 -0.776749094 -1.86170282 -3.6715827
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.8538 0.1176 0.0286
plot(lda.fit3, col=as.numeric(human_scaled3$cl), dimen=2, main = "Figure 9. LDA model 3")
lda.arrows(lda.fit3, myscale = 2, col = "#666666")In our LDA model 3 we have a lot of similarities to models 1 and 2 with the classification method. Figure 9 shows that mother’s mortality rate is again almost exactly lined with LD2 and collects most of the variation within the groups in the data (after mean years of education). Also mother’s mortality is negatively correlated to ratio of women to men in higher education: when women die in greater numbers they have more struggle to get to higher education.
As we saw in our classification method, mothers’ mortality, life expectancy and mean years of education are the most significant variables. When performing classification with clusters instead of education levels, we found out those same factors determine the groups. Clustering did confirm valid our original separation based on levels of education and the key variables.
I conclude that mothers’ mortality rate, life expectancy and mean years of education are the key variables when determining the level of expected years in education as was expected. Adolescent birth rate did not stand out as an important factor as was expected in hypothesis level.
Both mothers’ mortality rate and adolescent birth rate were bit curved to the left, so these results might not be as strong as suggested. With bigger data sample this hypotheses could be tested further.
Surprisingly enough, mothers’ mortality is more important factor than Gross national income. It seems that it is more important to take care of the mothers than money if we want highest possible education. So in order to flourish, nations need to take care of their women.
Happy womens’ day!